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ABSTRACT 

The prediction of periodical time-series remains challenging 
due to various types of data distortions and misalignments. 
Here, we propose a novel model called Temporal embedding- 
enhanced convolutional neural Network (TeNet) to learn 
repeatedly-occurring-yet-hidden structural elements in peri¬ 
odical time-series, called abstract snippets, for predicting fu¬ 
ture changes. Our model uses convolutional neural networks 
and embeds a time-series with its potential neighbors in the 
temporal domain for aligning it to the dominant patterns 
in the dataset. The model is robust to distortions and mis¬ 
alignments in the temporal domain and demonstrates strong 
prediction power for periodical time-series. 

We conduct extensive experiments and discover that the 
proposed model shows significant and consistent advantages 
over existing methods on a variety of data modalities rang¬ 
ing from human mobility to household power consumption 
records. Empirical results indicate that the model is robust 
to various factors such as number of samples, variance of 
data, numerical ranges of data etc. The experiments also 
verily that the intuition behind the model can be general¬ 
ized to multiple data types and applications and promises 
significant improvement in prediction performances across 
the datasets studied. 


1. INTRODUCTION 

The behaviors of many of the world’s inhabitants are fun¬ 
damentally bound by the cycle of the sun and the moon 
which creates day and night. It is the reason why across 
the days of an average person, there often exist periodical 
patterns for their mobility or more generally, their behavior 
[26[ |27| . Utilizing such re-occurring patterns could drasti¬ 
cally benefit various modern ubiquitous applications. For 
example, the ability to predict a day’s power consumption 
of many individual houses at midday will be profoundly ben¬ 
eficial for the smart grid to manage dynamically its power 
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supply resources. While in the scenario of smart location 
tracking 14 34 , with a replenish-able energy budget the 


system either aims to minimize the energy efficiency of loca¬ 
tion tracking, or attempts to maximize the tracking accuracy 
given a fixed energy budget. A crucial challenge involved in 
such a smart tracking system is to estimate at any time of 
day how much further the moving entities will move for the 
remainder of the day. Ideally, with a greater estimated value 
of the total travel distance, the system will employ a more 
conservative sampling strategy (lower sampling frequencies) 
to cover as much as possible of the whole trip using the re¬ 
stricted energy budget, whereas a more aggressive strategy 
(higher sampling frequencies) will be favored on the presence 
of a smaller estimated total travel distance, so that better 
tracking precision will be achieved. Clearly the estimation of 
the entity’s daily travel distance using partial information is 
a challenging yet crucial ingredient for the system’s success. 

Approaches have been proposed to predict generic time- 
series and many of them have capitalized on the phenomenon 
that for each individual there often exist re-occurring small 
fragments of time (which we call “snippets”) in their histo¬ 
ries. By detecting and reusing such snippets, we are able 
to reconstruct a day with the elements from previous rel¬ 
evant days. We show an example of snippet learning for 
daily traveling time prediction and the difficulties it faces 
by using a commuter’s daily routines. It is worth noting 
that throughout the entire paper, we assume that besides 
the time-series itself, no other support information such as 
locations are available to the prediction algorithm. For ex¬ 
ample, to predict a day’s travel distance, the algorithm’s 
only input is a partial time-series of the distances traveled 
in each interval. With a 30-minutes interval, the whole day 
will have 48 time-series entries, and we aim to use the first 
half of them to predict the accumulated travel distance for 
the whole day. 

Imagine that a person in our example has two usual rou¬ 
tines: 1) on workdays the person goes to work by a par¬ 
ticular bus line that stops outside the apartment every 8 
a.m., and arrives at the workplace around 9 a.m. The per¬ 
son gets lunch around 12 p.m. at someplace near the work¬ 
place everyday, and finishes work around 5 p.m., 2) on week¬ 
ends the person prefers going to the beach in the morning 
and coming home in the evening. In the ideal case, the 
person begins and finishes the same activity at the exact 
same time on every workday, and the resulting time-series 
for travel distances would be identical across days. With 
snippets, a time-series for a workday would then be trans¬ 
formed into a series of snippets like <Ad^aiking to bus stop a, 
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ride on bus line I 5 ^^walking into officer ^i^working: 

Now to predict how much further the moving object will 
move for the remainder of the day at a certain time on the 
day (e.g. midday), we are left with a simple task. For ev¬ 
ery interval of the snippet sequence in the example, if the 
current day shows an identical partial time-series for that 
interval, the person is likely to be working that day and is 
likely to yield the same total travel distance as any other 
workday. The same method works for the weekends too. 

In reality, such patterns do repeat themselves, only not in 
such a perfectly aligned way but instead often on a shifted 
timeline and at a differing pace. Instead of having high co¬ 
herences at all times between two working days of a person, 
in reality a day’s time-series may often be partially similar to 
and partially divergent from another day’s, posing a serious 
challenge for the aforementioned prediction method. There 
are many possible causes which prevent a perfect resembler 
for a snippet sequence from happening. For example, the 
bus in the morning may be 20 minutes late, or the person 
may wait for a coffee to miss the bus he/she is supposed to 
take. Then, the person may have a later than usual lunch 
at work. Finally, the person on one day decides to do usual 
item A/B in the order of B/A. Coupled with the huge num¬ 
ber of non-work-related locations a person could go to and 
the numerous possible sequences of visiting them, the re¬ 
sulting time-series could have a huge variety of distortions 
to the regular time-series. In such cases, how to effectively 
learn representative snippets and how to use them effectively 
remains a major challenge. 

To solve this complex problem, we adopt the concept of 
snippets but take a step forward and propose a robust learn¬ 
ing and time-series prediction model to systematically re¬ 
duce the effect of such distortions. Specifically, we make the 
following contributions in this paper: 

• We propose a novel regression model, which is based 
on convolutional neural networks, to solve the robust 
snippets learning and periodical time-series prediction 
problem. 

• We propose a novel technique called temporal embed¬ 
ding to improve the classical convolutional neural net¬ 
works’ capability for learning robust snippets and for 
predicting accurately. We design a network layer based 
on this concept, devise a complete four layer network 
(TeNet) for regression, and solve the corresponding 
backpropagation problem. We also offer a detailed case 
study to illustrate the effect of temporal embedding. 

• We conduct extensive experiments on 15 individual 
datasets representing three data modalities and one 
synthetic dataset to evaluate the advantages and char¬ 
acteristics of the proposed model. 

The rest of the paper is organized as follows. Next in 
Section we present the background and relevant literature 
of the problem studied. In Section we give the intuition 
behind TeNet, describe in detail the technique of temporal 
embedding and other layers of TeNet, and offer solutions to 
the backpropagation of TeNet. We then enter Section|^and 
evaluate the proposed model. Finally we conclude our work 
in Section]^ 

2. BACKGROUND AND RELATED WORK 


Learning abstract features (with neural networks in many 
cases) has been extensively studied in recent years and has 
proved effective in many applications. For instance, numer¬ 
ous studies |15[ |18| have shown that deep neural 

networks perform well for complex computer vision classifi¬ 
cation tasks, while many demonstrate that success can be 
achieved with deep learning architectures for audio classi¬ 
fication tasks as well [19[ |22| . These well-performing deep 
neural networks have a variety of core ideas, ranging from 
restricted boltzmann machines that utilize an energy model 
[13[ |17[ |13| , to sparse autoencoders that introduce an un¬ 
supervised “denoising” mechanism to remove insignificant, 
noisy signals from data [29[ , to using convolution as 

an effective way to learn representative features robust to 
geometric locations of images ]18[ . 

The main advantage of such methods is that they have 
a strong capability of unravelling the hidden hierarchical 
structure of data to derive representative features. Moving 
from a shallower architecture to a deeper architecture, these 
models progressively detect essential components of the data 
from local parts like strokes in human handwriting, to global 
compositions such as digits or objects. Among the variations 
of neural networks, inspired by biological processes [^, con¬ 
volutional networks in particular excel in finding such ab¬ 
stract features that are robust to geometric variations in 
images [18| . Interestingly, such advantages of convolutional 
neural networks are present not only in vision tasks, but 
also in speech recognition and natural language 

processing . 

Now we consider the periodical time-series prediction prob¬ 
lem for data such as daily traveling distances or daily house¬ 
hold power consumptions. To tackle this problem, conven¬ 
tionally statistical models such as autoregression and its 
variants are strongly favored. While in the past decade, 
realizing there is abstract and structural information be¬ 
neath the raw numeric values in the time-series, researchers 
have experimented to discover such patterns by clustering 
or “motif” discovery 23 26 27 . Though conceptually sim¬ 


ilar, these “motifs” usua ly are concrete subsequences that 
are restricted by specific mathematical dehnitions, which 
differentiate themselves from the concept of abstract, rep¬ 
resentative snippets in our paper. However, how to design 
a method that can find abstract patterns as well as predict 
future values, that meanwhile is robust to various temporal 
distortions and misalignment, is yet to be answered. In¬ 
spired by the success of convolutional neural networks, we 
investigate using convolution-based neural networks to ad¬ 
dress this problem. 


3. THE MODEL 
3.1 Intuition 

The two main challenges for the periodical time-series pre¬ 
diction are: 1) how to hnd representative snippets for the 
prediction of future changes; and 2) how to minimize the ef¬ 
fect of distortions in the temporal domain and get accurate 
regression results. Here we examine the two challenges sepa¬ 
rately and propose solutions to them from a neural networks 
perspective. 

The first challenge, i.e. snippet learning, involves find¬ 
ing abstract sequences in the training time-series. Naturally 
there is an assumption that the snippets should only be of 
moderate length. For example, if we were to predict daily 
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input size 1x6 1x6 2x1x2 1x3 

output size 1x6 2x1x2 1x3 1x1 

FigurG 1: An instance of TeNet for 6-d input. It is composed of a sparsely connected temporal embedding layer, a convolu¬ 
tion/pooling layer with two filters of size 1x3 and pool size 1x2 (following the conventions in constructing convolutional neural 
networks, the convolution layer and max-pooling layer are illustrated as a single layer), a fully-connected sigmoid layer that 
transforms the feature map from size 2x1x2 into 1x3, and finally an 11-regularized least-squares regression layer that yields 
the predicted value. are the weights and bias of the connections between layers I and Z + 1. Connections with the same 

colors in the convolution layer indicate that those connections share the same weights, and the two shaded areas represent the 
two feature maps from the filters. The dimensionalities of the weights, the input and the output for each layer are provided at 
the bottom. Biases are not illustrated in this figure. 


human mobility, a time window of from one half-hour to 
a few hours would be a reasonable setting, as intuitively 
such a period of time should be enough to cover most of the 
common trips in daily life. Hence in the prediction model, 
we examine such periods of time using a convolutional ap¬ 
proach. We create randomly initialized Hlters that have a 
given, moderate length as the length of the target snippets. 
In 2D image classification tasks, filters in convolutional neu¬ 
ral networks are often used as edge detectors, while in ours, 
the filters will act as “snippet detectors”. In the training 
phase, the weights for the hlters will be adjusted during the 
backpropagation so that they respond maximally to the re¬ 
occurring and signihcant components in the training data. 

We then solve the second challenge by adding a “temporal 
embedding” operation in the neural network. The tempo¬ 
ral embedding process provides a supervised way of denois- 
ing subspace learning. When dealing with time-series, a 
naive technique is to “shift” the training data forward and 
backward along the timeline. For example, a shifting rou¬ 
tine with windows size 1 would transform a training sam¬ 
ple X =< xi, X2, ■■■, Xd >—>■ y into three training samples 
X =< Q,Xi,X 2, ■■■,Xd-l >->■ y,X =< X 2 ,X 3 , ...Xd,Q 
y,x =< xi, X 2 , ■■■, Xd >—>■ y- Though useful sometimes, this 
naive approach introduces heavy noise by including artihcial 
training samples that may never actually happen in the real 
world. Also it is unable to beneht case where the order of the 
subsequence is changed. We argue that the naive technique 
can evolve to a much more effective approach called tem¬ 
poral embedding that integrates into the learning process 
mechanisms for removing distortions. With temporal em¬ 
bedding, two temporally-shifted copies are created for each 
sample during the learning process, and then the original 
sample and the two shifted copies are encoded into a sin¬ 
gle sample so that the processed sample will not only carry 
its own information, but also bear a piece of information 


for each of its shifted neighbors. Again, the weights for the 
encoding are learned in a supervised way during backprop¬ 
agation. 

Next we present an overview of the TeNet model. 

3.2 Model Overview 

We propose a convolutional neural network to learn the 
snippets from the periodic time-series as illustrated in Fig¬ 
ure The model has three invisible layers, namely the tem¬ 
poral embedding layer, the convolution/max-pooling layer, 
and the sigmoid layer. The output layer is an 11-regularized 
least squares regression layer. The illustrated model is an ex¬ 
ample instantiation of the proposed model, with the input 
size, embedding window size, number of snippets, snippet 
size, max-pooling and sigmoid layer sizes to be 6, 1, 2, (1,3) 
and (1,2) and 3 respectively. The model implements the 
following work flow: 

1. It takes an input sample, and applies the temporal 
embedding. This layer transforms the sample into a 
denser representation with not only the sample itself 
but also information of its potential temporal neigh¬ 
bors. The weights of the transformation are iteratively 
updated during the training process. 

2. The embedded input is sent into a convolution layer 
where a set of filters, or snippet detectors, scan through 
the sample using the convolution operator. Each snip¬ 
pet will be convolved against the sample, resulting in 
a feature map considered as the snippet’s response to 
that sample. 

3. The snippets’ responses to the sample, being suppos¬ 
edly sparse and representative, are input into a sigmoid 
layer to combine some of the responses into higher-level 
and more abstract representations in lower dimensions. 
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This transformation also involves a set of weights that 
is learned over the training process. 


4. Finally the abstract representation of the sample is 
used to perform an 11-regularized least-squares regres¬ 
sion to obtain the predicted value. The intuition be¬ 
hind the 11 regularization is that if we consider the 
previous layer’s output, ie. the high-level neuron’s re¬ 
sponses to the sample, as high-level pattern recognizers 
responses to the signal, a sparse solution will utilize 
the most significant responses and hence will be less 
sensitive to noise 1211 
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embedded to the element itself (the embedding size). This 
layer has 2 x d*®-!-! sets of parameters, represented by matri¬ 
ces and Wrj^ G same number 

of constant sparse matrices Wi-, Wm and Wr^ G *. 

The subscriptions I and r represent the direction of the 
neighbors on the timeline, and j here means the weights 
for the neighbor in the final embedding. In the case of 
d*'’ = 1, there are three W matrices and three W matrices in 
this layer. The six matrices together implement the embed¬ 
ding operators. Here we use the input dimensions in Figure 
(where = 6) as an example for how this layer works. 
The constant matrices, are defined as: 


In the following subsections we discuss the layers sepa¬ 
rately in detail. In the rest of the paper, the technical de¬ 
tails of the neural network will be described mostly in vector 
forms, and we will use the assumptions and notations listed 
in Table [T] 


Table 1: Table of Notations 


Notation 

Description 

x G K'* 

the input time-series of length d 

1 

the layer number 

wU) 

the weights for the layer 


the bias for the layer 

W) 

the input of neurons in the layer 


the intermediate values for the layer 

7 ^) 

the activation function for the layer 

(5‘ 

the intermediate error (cost) of the layer 

J{W,b-,x,y) 

the network’s cost given VF, b; x, y 

T 

the transpose operator 


the dot product operator 

© 

the element-wise product operator 

* 

the convolution operator 

7 ^ 

the derivative of function 


3.3 Temporal Embedding 

The temporal embedding layer aims to align less dominant 
samples to the dominant patterns by reducing the tempo¬ 
ral distortions and misalignment (e.g. shifting or skewed se¬ 
quence of events), corresponding to two cases in our previous 
example: 1) the commuter starts the day 30 minutes earlier 
than usual, so every event in the morning rush hour is shifted 
ahead equally by 30 minutes , 2) for some reason the com¬ 
muter does not take the usual bus line which directly stops 
at his workplace, instead he/she takes a train and walks 1km 
to work from the station. In the resulting time-series we will 
see two distinct effects as a result of 1) and 2). For example, 
assume that on normal day the travel distance time-series 
segment in the morning will be v =< 0,1, 2,4,1, 0 >, then 
for case 1 we will have u =< 1, 2,4,1, 0,0 >, and in case 2 
it will be u =< 0,1,4, 2,1,0 >. Now we assume both cases 
happen on the same day, giving us u =< 1,4, 2,1,0 >, which 
is heavily distorted from v. It is a significant challenge for 
a prediction algorithm to realize that for the two days the 
travel distances should be very similar despite the sequences 
and the values of their time-series are so different. 

Temporal embedding addresses this issue, by optimally 
embedding a value’s temporal neighbors into itself, so that 
for the whole dataset the dominant pattern remains un¬ 
changed but the distorted patterns are realigned. The layer 
is configured by one hyperparameter d*® that controls how 
many neighbors of an element in each direction should be 


= 


0 0 
1 0 
0 1 
0 0 
0 0 
0 0 


^(1) = w(i)r 


( 1 ) 


0 0 0 0 ' 

0 0 0 0 

0 0 0 0 

10 0 0 

0 10 0 

0 0 10 . 

= I = eye{d!'^'*) 


( 1 ) 

( 2 ) 


Weights in Wr^'^ that correspond to the Is in 

Wq, Wm and W^ represent the weights for the embedding 
of the sample’s left neighbor (forward), the sample itself and 
its right neighbor (backward) respectively, and they are ini¬ 
tialized with corresponding constant matrices respectively. 
The layer’s output is subsequently defined as follows: 

^(1) ^ ^(1) . Q ^ ^(1) Q jyW Q j^(l) 

( 3 ) 

= z^^'’ (4) 


enforces a constraint that the connections between this 
layer and its input are restricted, and only the weights at 
the desired neighboring positions for each element are used 
in the final embedding for that element. The layer yields the 
temporal embedded output (z*-^^) G or 6 in this 

example, as the output of the layer. One can also use the 
sigmoid function as the activation function in the temporal 
embedding layer, though our experiments show that the dif¬ 
ference it makes on the prediction accuracy is insignificant 
(most of the times adding the sigmoid activation will slightly 
decrease the prediction accuracy). 

The layer’s output is a vector of the same size as the input, 
however the embedded sample is now significantly more ro¬ 
bust to temporal distortions. With temporal embedding, the 
model detects dominant patterns in the training time-series, 
and tries to correct the systematical distortions within the 
specified time window. Using the commuter example, the 
model will find that the person’s regular time for the bus 
to work, and will try to realign the systematical misalign¬ 
ment on those unusual days. Some readers may argue that 
a simple moving average algorithm might be able to solve 
the distortion problem; however temporal embedding is far 
more effective, as the concrete example below shows. 


Discussion and Case Study. 

Recall our example with v and u, where v represents the 
dominant pattern in the dataset, while u represents a day 
that in fact will yield a similar end-of-day result but shows 
very distorted patterns in its time-series. Now given the 
parameter matrices Wi , Wm, Wr and the constant matrices 
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Wi,Wm,Wr initialized as in Equation]^ our objective is 
to realign u with v by eliminating the distortion, and mean¬ 
while keeping v as unchanged as possible, which is effectively 
equivalent to solving the following minimization problem in 
Equation 

y* = V ■ (IVl 0Wl + Wm 0Wm + Wr 0 Wr) + h (5) 

U =U-(Wl0Wl+Wm0W.m+Wr0 Wr) + b (6) 

arg min ||w* — n||^-|-||it^ — n||^ (7) 


and u). We conclude its result is clearly less successful com¬ 
pared to temporal embedding (4.5 —>■ 0 in squared error, 
4 —>■ 8 in intersection, and 0.11 —>■ 1 ). 

It is worth noting that although the temporal embedding 
layer in the proposed neural network is not exactly the same 
as in Equation as it does not have knowledge initially 
about which samples hold the representative patterns, as 
the training proceeds, the weights will progressively favor 
the reoccurring patterns, and eventually approach the solu¬ 
tion of Equation Next we describe the convolution, the 
max-pooling and the sigmoid layers. 


where v* and u* are the embedded new time-series. By 
solving the optimization, the non-zero weights in Wi0Wi, 
Wm. 0 Wm and Wr 0 Wr are determined as < 0, 0.61, 

0.24, 0.44,1 >, < 0, -0.22, 0.4, -0.15 > and < 0.66,0.24, 2.1, 
1 > respectively. Now and u* can be calculated accord¬ 
ing to Equations and and we subsequently investigate 
how temporal embedding performs in terms of preserving v 
and realigning u to w, compared with the moving average 
approach, with and u” being the output of v and u of a 
moving average of window size 3 {vi = < Vi-i,Vi, Vi+i >). 


Table 2: Temporal Embedding vs. Moving Average 


V 

< 0,1,2,4,1,0 > 

u 

< 1,4,2,1,0,0 > 

n* 

< 0,1,2,4,1,0 > 

u* 

< 0,1, 2,4,1,0 > 


< 0.5,1,2.3,2.3,1.7,0.5 > 


< 2.5, 2.3, 2.3,1, 0.3, 0 > 



Squared Error 

Intersection 

Pearson’s 

f , u 

4.5 

4 

0.11 

V, w' 

0 

8 

1 


0 

8 

1 


2 

6.3 

0.87 


3.1 

5.2 

0.02 


Table [ 3 ] measures the relations between the vectors before 
and after the transformations with three metrics, namely 
squared error, intersection and Pearson’s correlation. First 
we note that u is so distorted that the correlation between 
V and u is merely 0.11, which can be considered “uncorre¬ 
lated”. Now we examine the differences between the effects 
of temporal embedding and moving average. 

Ideally, the transformation should show the following prop¬ 
erties: 1) since v represents the reoccurring pattern in the 
training set, we want u* to be as unchanged as possible after 
the transformation 2) after the transformation, u* should 
be as similar to n* as possible, indicating that the misalign¬ 
ments in u has been minimized and u is realigned to the 
representative sample v. We verify the two aspects by ex¬ 
amining the relations between v and n*, and that between 
and and observe that temporal embedding has achieved 
both goals. 

First we observe that v* is identical to v (with 0 squared 
error), while u* has been transformed to a form that is per¬ 
fectly identical to v and now, with the dominant values at 
the second and third positions swapped and realigned to the 
third and forth position to be more inline with v. However, 
we can see moving average resulted in a squared error of 2 
between v and v", showing that v has not been preserved 
successfully in the transformation. Second, though moving 
average does strengthen the relation between v and u by re¬ 
ducing the squared error (4.5 —>■ 3.1) and by increasing the 
similarity by intersection, it has even resulted in a drop in 
the correlation (0.11 —>■ 0.02 compared with the original v 


3.4 Convolution, Max-pooling and Sigmoid 

The convolution/pooling layer performs a series of discrete 
1-d convolutions with a specified number of fil¬ 

ters of a specified length . Each of the filters “sweeps” 
through the entire input signal and takes the input signal 
segment at the corresponding position as input. With a fil¬ 
ter kernel =< > (taking the 

convention of reversely-ordered weights for convolution ker¬ 
nels and outputs), the filter’s output has the element: 

= ( 8 ) 

In the example in Figure we have set two filters with size 
1x3, hence in the convolution layer, each neuron will only 
be connected to three neurons from the temporal embedding 
layer. Such sparse connectivity between the filters to their 
inputs enforces that the convolution layer will be focusing 
on finding the local snippets with moderate lengths. 

Though the convolution traverses the entire time-series in 
a sliding-window style and seemingly has a positive effect in 
reducing the temporal distortions, it is very different from 
temporal embedding. The main factor differentiating them 
is in the weight-sharing scheme (see Figure[^. A filter in the 
convolution layer has its weights shared among all its out¬ 
put neurons (meaning a filter is sliding through the data, 
trying to match the same particular pattern), while in tem¬ 
poral embedding each neuron has individualized weights to 
enable optimal local embedding for each position. Such flex¬ 
ibility enables it to identify and realign much more complex 
distortions and misalignments. For example, given v =< 
0,1, 2,4,1, 0 >, convolution will not be able to recognize the 
close relation between u =< 1,4,2,1,0,0 > and v because 
of the heavy distortions in both the positions and the se¬ 
quences. In the experiments we will also show that without 
the temporal embedding layer, convolutional neural network 
does not work well on such time-series. 

The output of the convolution will be of the size x 
Figure[^s example where = l,d = 6, = 3, 

we have the 8 neurons in the convolution layer. The output 
is then received by the max-pooling layer, where only the 
maximal value is kept from any pool of 1 x 2. The filter’s 
output will hence be down sampled and transformed by an 
element-wise hyperbolic tangent function, reducing the out¬ 
put to 4-dimensional. Then as the last hidden layer, the 
sigmoid layer will perform a projection from the convolu¬ 
tion/pooling’s output to a further reduced dimension as a 
means of both learning non-linear features and dimension 
reduction. Finally, the input is transformed into a dense, 
robust and representative feature representation of 1 x 3. In¬ 
tuitively we can consider the sigmoid layer as a higher-level 
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feature learner, after the convolution layer has discovered 
those relatively more “local” snippets. 


and can be updated using similar procedures. 

Meanwhile, is updated with the gradient: 


3.5 ii-regularized Least-squares 

The output layer of the proposed model is a 11-regularized 
least-squares regression layer, defined as: 

= + ( 9 ) 

with the cost function in the from of: 

JiW,b-x,y) = + b^^^ - y\\'^ + \\\W‘-‘^^\\i (10) 

where A is a hyperparameter for the weight of the regular¬ 
ization term. 

The advantage of using the 11 regularizer over 12 is that 
the 11 regularizer forces the optimization to find a sparse so¬ 
lution that only uses the most distinctive high-level features 
to conjure the final prediction [21[|25| . With the 12 regular¬ 
izer the weights tend to have smaller variance, often making 
the model spread the energy thinly across all features, hence 
making the model less distinctive and less accurate. 


3.6 Backpropagation 

The parameters in the network are updated by stochastic 
gradient descent. In particular, can be learned by: 

- y) + ^ signiW^*^-^) 

Where sign{) is the sign of a vector. One can speed up this 
optimization process using the methods proposed in [28] . 

To update the parameters in the temporal embedding 
layer, taking as an example, we apply the chain rule 

and arrive at: 

dJ{W,b-x,y) _ dJ{W, b; x, y) ^ ^ 


dW, 


( 1 ) 


dgW 




^ dJ{W,b;x,y) ^ „(i) 

dzW ' ‘ ' 

Since the element-wise product has the property: 

flfi) . 0 0 

we have the partial derivative of z^^^ w.r.t. as: 

^=a«0WW 


dw^ 


(1) 


( 11 ) 


( 12 ) 


(13) 


We calculate the error propagates from layer 2 to layer 1 as: 

(1) _ dJ{w,b-,x,y) dz^^^ as'i) 

dz(^) ^ dgW ^ dzW 

= E 0 - E’) (14) 

Z =1 2=1 


where flipQ returns the input vector in reversed order. With 
the convolution layer’s back propagated error ^^ng (which 
can be calculated by the method described in 16 ), 
can therefore be updated with the gradient: 


dJ{W,b-x,y) 


dW, 


( 1 ) 


di _ 

[jf > * /Zip(W,‘^^)] 0 (a^^^ 0 W"/^') (15) 


dJ{W,b;x,y) _ ( 2 ) 

96(1) 

2=1 


(16) 


Next we present the experimental results and offer in- 
depth analysis and discussion. 


4. EXPERIMENTS 

In the experiments, we conduct extensive tests on the pro¬ 
posed model, with 15 individual datasets and 4 competitive 
methods. The goals of the experimental studies are fourfold: 
1 ) to evaluate the prediction performance of the proposed 
model, in terms of prediction accuracy, and compare it with 
the competitive models; 2) to evaluate the model’s behavior 
and sensitivity to features of diverse datasets; 3) to investi¬ 
gate the isolated effects of temporal embedding; and 4) to 
visualize the snippets and show how they work with inter¬ 
mediate values from the learning process. 


4.1 Datasets 

To support the comprehensive evaluation, we use a variety 
of univariate, periodical time-series datasets that represent 
three modalities, ranging from human mobility patterns to 
household power consumption. The reason we choose these 
modalities is that the behaviors they represent are expected 
to exhibit complex periodical patterns in daily cycles, which 
is an ideal testbed for the proposed model to demonstrate 
its capability of discovering and capturing such abstract fea¬ 
tures and to test its robustness to various factors. 

The first modality is Human Mobility - daily traveling 
Distance (HMD) in kilometers, and the second is Human 
Mobility - daily traveling Time (HMT) in minutes. Both 
modalities are extracted from the LifeMap [^) that con¬ 
tains human mobility traces collected from eight individuals, 
spanning from a few months to around two years. In total 
there are 52,819 position fixations, most of which are from 
regular sampling every two to five minutes. HMD is the total 
displacement for an individual in a day, and HMT is accu¬ 
mulated from short-term movements calculated as follows: 
for each five minute interv^ if the individual’s displacement 
is higher than 500 meters N then the hve-minute period is 
counted as “traveling” and is accumulated to the daily total 
traveling time. 

The third modality is daily Household Power Consump¬ 
tion (HPC). Two datasets are used for this modali^ i.e. 
household power consumption datasets from Franc^ and 
Australicj^ (HPC-FR, HPC-AU). HPC-FR consists of 
2,075,259 active power consumption in watt sampled every 
minute for 48 months from a single household. HPC-AU 
consists of 618,189 household power meter readings in kwatt 
hour sampled every 30 minutes from 31 households for up 
to 29 months. 

To prepare the data, we developed a program to extract 
only the samples that have complete (or nearly complete) 
day cycles, meaning that every data sample used must have 
regular readings in each period of time in a complete day. To 
obtain meaningful results, only individuals with more than 
150 days of records are used in the experiments. 

^median errors of localization with assisted GPS, WiFi positioning and cellular 
network positionins are reported to be 8. 74 and 600 m 1331 

^ https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+Dower+consumption 
^ http: //data. gov. au/dataset/sample-household-electricity-time- of-use-data 
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Table 3: Performance comparison 


ID 

n 

[~] 

(7 

HitRate (%, @20%|@30%) Error (MRE/MSE) 

TeNet SVLN SVSIG SVPOLY MKR | TeNet SVLN SVSIG SVPOLY MKR 

Household Power Consumption - Australia (HPC-AU) 


8 

874 

3.1-36.4 

5.6 

71|89 

55|74 

59|83 

60|82 

48|75 

0.16|7.7 

0.24|14.4 

0.23|10.5 

0.23|9.5 

0.29|17.9 

15 

870 

1.5-30.9 

5 

65184 

58 78 

56 79 

59 80 

53 75 

0.2|11.2 

0.2514.2 

0.26|15 

0.24|13.6 

0.26 18.5 

14 

670 

1.7-59.2 

7.2 

65179 

42 64 

40 60 

50 72 

63 77 

0.21|10 

0.26|13.7 

0.31|15.3 

0.24111.1 

O.31|10 

7 

665 

8.4-38.4 

5.0 

75192 

67 88 

73 90 

75|91 

72 89 

0.14|15.9 

0.16 17.1 

0.15 15.8 

0.14|14.4 

0.16|15.7 

5 

661 

0.2-8.0 

1.6 

71182 

58 76 

57 76 

58|76 

70|82 

0.84|1 

1.1711.82 

1.1 1.7 

1.1|1.7 

0.39|0.9 

12 

243 

7.6-27.2 

2.8 

90197 

88 96 

90|96 

85 92 

78|91 

0.09|4.6 

0.1|4.6 

0.09|4.8 

0.12|8.2 

0.17|8 

10 

242 

4.3-42.6 

7.7 

84196 

74 92 

78|94 

70 90 

71 83 

0.12|8 

0.17|7.5 

0.13|9.2 

O.12|10.6 

0.16|7.15 

1 

241 

8.9-37.9 

4.6 

84196 

80 95 

75 92 

76 91 

76 93 

0.12|12 

0.12|11.2 

0.14|14 

0.15|22 

0.1319.23 

13 

241 

4.4-46.7 

6.3 

65182 

62 78 

59 79 

63 80 

59 80 

0.19122 

0.2|22.3 

0.2|26 

0.21123 

0.24118.7 

29 

233 

17.3-73.5 

11.3 

77193 

75 92 

59 80 

77193 

76 92 

0.14|43 

0.13|31 

0.2|85.0 

0.13|34 

0.15137.6 


Household Power Consumption - France (HPC-FR) 


1 I 161 I 10-79.5 I 10.3 I 64|83 56|75 53|73 60|75 63|81 | 0.18|74.6 0.23|111 0.26| 110 0.22|100 0.2|67 


Human Mobility - Traveling Distance (HMD) 


8 

206 

8-99 

15.2 

46|63 

48|62 

35|50 

48|63 

33|45 

0.28|169 

0.29|170 

0.35|198 

0.31|300 

0.4|323 

12 

156 

9.5-60 

11 

59|83 

56|77 

43 65 

54 70 

45 67 

0.20|74.5 

0.23 101 

0.28|96 

0.27 285 

0.27|103 


Human Mobility - Traveling Time (HMT) 


8 

193 

55-345 

47 

51|70 

47|64 

42|58 

40|57 

45|66 

0.23|32.1 

0.24|36.6 

0.28|59 

0.35|179 

0.25|43 

12 

243 

37-280 

32.4 

61174 

58 70 

48 67 

49 68 

57|74 

0.21124.0 

0.23 29.4 

0.25 29 

0.3 60 

0.22125 


For the human mobility datasets, we use the two individ¬ 
uals’ datasets with the highest quality of data in terms of 
timespan (>150 days) and sampling frequency. We extract 
the traveling distances and traveling times for each inter¬ 
val (e.g. a 30 minutes interval creates 48-d time-series for a 
day), and use the resulting time-series for the experiments. 
Similar preprocessing is applied on the power consumption 
datasets. After preprocessing, each time-series sample has d 
elements as x =< xi, ...,Xd >, each Xi is the occurred value 
in the corresponding time interval (non-cumulative). 

For each individual dataset, we randomly divide the sam¬ 
ples equally into three folds: the training set, the validation 
set and the test set. The model is trained using the training 
set, and is then tested on the validation set. Such cross- 
validation is performed on the same individual dataset for 
five times with random splits, and the reported performance 
is the averaged value cross the five iterations. The settings 
of hyperparameters with the best validation performance are 
kept as the hyperparamters of the model. Finally we test 
the model on the test set and report the performance. 

4.2 Evaluation Settings 

For evaluation we consider the periodical accumulation 
prediction problem, where each input x' G R'* (d' < d) is a 
head segment of a complete x and corresponds to a target 
value y = ** representing the periodical accumulation. 

Clearly the model can be used to perform other types of pre¬ 
diction such as time-series forecast or fc-ahead prediction. 
Due to space limit here we use periodical accumulation pre¬ 
diction as a showcase for TeNet’s performance advantages. 

TeNet is implemented using Python with the Theano frame- 
wor ifl For comparison, we consider four competitive meth¬ 
ods, namely Support Vector regression with Linear kernel 
(SVLN), Support Vector regression with Radial Basis kernel 
(SVSIG), Support Vector regression with Polynomial kernel 
(SVPOLY), and Multiple Kernel Regression (MKR) [24]. 

The parameter selection criterion for the SV-family is that 
we carefully tune the parameters e (error margin), d (de¬ 
gree of kernel function), and 7 (kernel coefficient) for ker¬ 
nels. Each parameter’s value is selected from the sets e € 
{10-®, 10-^ ...1,..., 10^ 10®}, d G {1, 2, 3}, 7 e {10-®, 10-^ 

http://deeplearning.net/software/theano/ 


...1,..., 10^, 10®} respectively, so in total there are 363 com¬ 
binations for each model. For each test run, during training 
we iterate through every combination of £,d and 7 ’s candi¬ 
date values, and keep the values that generate the highest 
accuracy on the validation set, then use these parameters on 
the test set and report the results. For comparable evalua¬ 
tion against MKR, we use an offline implementation where 
test samples are not used to update the parameters, and the 
number of support vectors is set to 120 for matching the 
parameter size of TeNet. The hyperparameter selection of 
TeNet follows the same procedure. We provide more details 
in Section 14.6.21 

For most of the experiments d is set to 28, meaning for 
each day, the time-series up to 2 pm is known to the model. 
Selecting this particular number is because considering hu¬ 
mans rarely remain active from 12am to 4am and the values 
in that period are almost all zeros, the first 28d represent in¬ 
formation from exactly half of the active period from 4am to 
12am of the next day. Such setting is challenging in the sense 
that the gap between 2 pm to 12 am next day is substantial 
and it leaves numerous possible outcomes for the daily ac¬ 
cumulation. The complexity involved hence provides insight 
about how well the proposed and the competitive models can 
capture an individual’s daily patterns and make prediction 
from limited information. 

Next we present the experimental results for the proposed 
method and the competitive methods, and also offer in- 
depth discussion about hyperparameter tuning and about 
the effect of temporal embedding. 

4.3 Prediction for Periodical Accumulation 

Table studies the prediction performances of the pro¬ 
posed method and four competitive methods on 15 indi¬ 
vidual datasets of three different modalities, evaluated by 
average HitRate(HR)@20% and 30%, Mean Squared Error 
(MSE) and Mean absolute Relative Error (MRE). Using four 
metrics is due to that for datasets with long-tailed values 
(which human behaviors can often be characterized to be 
[11| ), as an absolute measurement, MSE alone is not an 
ideal metric to evaluate a regression method’s performance 
because it is heavily biased by samples in the long tail [31[ 
|32| . Therefore we mainly use relative measures for the eval¬ 
uation while keeping MSE as a reference. 
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The highlighted numbers in red, black, magenta and blue 
indicate the winning performance on that dataset under the 
corresponding metric ( magenta—> HR@20%, blue—>'HR@30%, 
red—^-RE, black—> MSE). Multiple highlighted numbers with 
the same color in a row indicate multiple winners under that 
metric on that dataset. We also report some of the proper¬ 
ties, i.e. the total number of samples n, the numeric range 
and standard deviation ct, for each individual dataset. A 
closer look at these dataset statistics suggests large varieties 
in terms of number of samples (from 156 to 874), numeri¬ 
cal ranges (0.2 to 345) and variances ( a from 2.8 to 47). 
To present the reader with more intuitive and meaningful 
results, the numbers shown are unnormalized. 



Figure 2: Mean Average Performance 



no.samples entropy 

(a) n vs. adv. MRE (b) entropy vs. adv. MRE 

Figure 3: The correlation between TeNet’s perfor¬ 
mance advantage and sample size/data complexity 

Generally, the distribution of the highlighted and win¬ 
ning performances shows that TeNet achieved best results 
in most of the cases, with a few but non-systematical excep¬ 
tions spread across the competitive methods. Out of the 15 
individual datasets, TeNet has won 14 entries in HR@20%, 
15 entries in HR@30%, 13 entries in MRE, and 7 entries in 
MSE, showing a superior performance among the evaluated 
models. SVLN and SVSIG show least competitive results 
by having 1, 0, 2, 1 and 1, 0, 1, 0 winning performances 
respectively. SVPOLY obtains slightly better results with 
3, 2, 3, 0 wins. MKR on the other hand, has shown com¬ 
parable results in MSE but far less competitive results in 
other metrics, by having 0, 0, 1, 8 wins. In addition, we hnd 
that MKR is less robust to larger numerical ranges such as 
in HMD-8, HMD-12, HMT-8, and HMT-12, while TeNet 
demonstrates consistent performances cross all datasets. 

To compare the methods quantatitively, we plot Figure 
and show each method’s mean average scores cross all in¬ 
dividual datasets (MSE is normalized with the maximum 
MSE among the methods in each entry). On the 15 in¬ 
dividual dataset, TeNet achieved best average performance 
under all four metrics. Taking a TeNet vs. all approach, we 
find TeNet’s performance and the average of other methods’ 


performance under HR@20%, HR@30%, MRE and MSE are 
69 vs. 60, 84 vs. 78, 0.22 vs. 0.27 and 34 vs. 51 respec¬ 
tively, showing that TeNet makes a relative improvement of 
15%, 8%, 19% and 33% respectively under the correspond¬ 
ing metric. Then if we investigate TeNet vs. the best among 
the rest, with HR@20% 69 and HR@30% 84, TeNet beats 
the second best HR@20% 61 (SVLN, SVPOLY) by 8, the 
second best HR@30% 78 (SVLN, SVPOLY) by 6; on MRE 
and MSE, TeNet’s average errors are 0.22 and 34, while the 
second bests are 0.24 and 40 (MKR). Hence for all 15 in¬ 
dividual dataset, in average TeNet marks an 13% increase 
in HR@20%, an 8% increase in HR@30%, a 9.1% decrease 
in MRE and a 15% decrease in MRE to the second best 
method under each corresponding metric. We also observe 
that though in all 15 individual datasets TeNet obtained 
the best performance under HR@30%, the average winning 
margin is the smallest than those under other metrics. This 
is because HR@30% is a relative looser measurement than 
other metrics, which leads to the result that less accurate 
prediction tends to have similar performances. However, 
the consistent advantage of TeNet in not only HR@30% but 
all four metrics still suggests that it has the best prediction 
accuracy. We hence conclude that TeNet has shown consis¬ 
tent advantages which are robust to variations in the data 
modality as well as the statistics characteristics of the data. 



(a) d ' 


(b) d - 


Figure 4: The effect of the kernel function degree d 

We further examine TeNet’s ability to scale up its learn¬ 
ing effectiveness with a growing sample size or an increasing 
complexity of the data. Taking MRE for example, we mea¬ 
sure two correlations using Pearson’s correlation coefficient: 
1) the correlation between the averaged performance advan¬ 
tage and the sample 

size, 2) the correlation between the averaged performance 
advantage and the entropy, for each individual dataset. The 
measurements yield correlation coefficients 0.7 and 0.79 re¬ 
spectively, suggesting a strong correlation between each set 
of the variables. Such patterns mean that as the sample size 
or the complexity of the data grows, TeNet is able to learn 
more effective than other methods to achieve better perfor¬ 
mance. The correlations are also visually identifiable as we 
plot the the performance advantage ratios in Figure 

4.4 The Effect of d 

Figure 1^ illustrates the effect of the feature dimensional¬ 
ity d on the prediction accuracy. Here we use HPG-AU-8 as 
a case study. Figure 4(a) shows the changes of MRE and 


normalized MSE to a growing d. Unsurprisingly, both errors 
decrease monotonically as d increases, from 1, 0.35 at d = 8 
to 0.08, 0.07 at d = 44. Figure 4(b) depicts how the HR re¬ 
sponds to a growing d. Again, we see monotonic growths (al¬ 
most, except for d = 16) in HR@20% and HR@30%. These 
results confirm that TeNet can effectively use the additional 
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information and in the mean time has received little impact 
from the noise in the additional dimensions. 


(a) Random Snippets (b) Learned Snippets 

Figure 5: Visualization of a set of learned snippets 
by TeNet (correspond to the convolution filters) 

4.5 The Effect of Temporal Embedding 

In Section |3.3| we discussed how hypothetically tempo¬ 
ral embedding would boost the performance of the model 
by automatically realigning the distorted time-series to the 
dominating patterns in a dataset, and verified it with a case 
study on a synthetic example. To further validate this hy¬ 
pothesis on real data, we create a designated dataset from 
HPC-AU-8 by performing the following procedure: 

1. We run a clustering with the affinity propagation method 
in [^, and find the top 10 exemplars. 

2. We take the exemplars and generate 300 synthetic sam¬ 
ples (30 for each exemplar) by distorting the exemplars 
with randomly selected operations such as swapping 
two neighboring segments or shifting the data forward 
and backward. They are equally split into training, 
validation and test set. 

3. We train a model with a modified classical convolu¬ 
tional neural network fore regression (CNN, input —> 
convolution/pooling —^ sigmoid —>■ ll-linear regression) 
without temporal embedding, and a model with TeNet, 
and examine the performance differences. 

The results are reported in Table We observe that with 
the temporal embedding layer, the prediction accuracy has 
been improved by more than a half (15.5 to 6.4, 0.34 to 
0.12) for MSE and MRE, and for about 100%/40% in Hi- 
tRate@20% and 30%. This shows that temporal embedding 
is able to learn the weights which are conceptually equivalent 
to a reverse operation for the distortions and misalignments. 
Table 4: Performance w/&:w/o temporal embedding 



HR@20% 

HR@30% 

MRE 

MSE 

CNN 

38 

66 

0.34 

15.5 

TeNet 

75 

93 

0.12 

6.4 


4.6 Discussion 

4.6.1 Distinctive Snippets 
We present a visualization of the random snippets and 
learned snippets for the first cross-validation iteration on 
HPC-AU-8 in Figure Each cell is a snippet, a segment 
of time-series the model deems representative. The figures 
show some noteworthy properties. Firstly the random snip¬ 
pets are fairly dense, while the learned ones are much more 
sparse, meaning that in most of cases there are only a smaller 
number of spikes and valleys in each learned snippets. Sec¬ 
ondly, the sparsity of the learned snippets is also accompa¬ 
nied by a visually identifiable high distinctiveness across the 


learned snippets, which means snippets learned tend to be 
different from one another because they effectively capture 
different patterns in the training data. Both properties sug¬ 
gest that the snippets are truly learning from the patterns 
in the dataset and both properties have a positive effect on 
the model’s prediction accuracy. 

4.6.2 Selection of Hyperparameters 
As an issue often posed to complex learning models in¬ 
cluding neural networks, how to select the hyperparameters 
is an open question studied by many [15| . There are six 
hyperparameters in the proposed model: 

Table 5: Hyperparameters and selection candidates 


Notation 

Description 

Candidates 

di 

filter size 

{3,5,7} 


no.kernels in conv. layer 

{20,30,40,60} 


learning rate 

{0.01,0.02} 

dP 

temporal embedding step 

{1,2} 

n(3) 

no.output in sigmoid layer 

{12,16} 

A 

weight for the 11-term 

{0.1,0.01,0.001} 


In this paper, since the sizes of the datasets are moderate, 
we use an intuitive approach to find the hyperparameters 
for the testing. The selection and testing processes follows 
that described in the third paragraph of Section |4.2[ One 
can also use the greedy hyperparameter selection processed 
described in We also used two optional data prepro¬ 
cessing, i.e. high pass filtering to denoise, and data shifting 
to synthesize more training data. The activation of each 
technique is subject to a control parameter which is tuned 
using the same process. 

Note that since all the hidden nodes in layers 2, 3 output 
small values only, with the settings we used for experiments, 
the regression layer’s ability to predict larger numbers (e.g. 
>1000) is limited. To predict larger numbers, one can con¬ 
sider either rescaling the data or setting smaller A to adjust 
to the numerical range of the specific dataset. 

4.6.3 Network Depth and Number of Parameters 

The proposed model has a moderate number of layers 
(four if we count the convolution/pooling as one), and hence 
a moderate number of parameters to estimate. For example, 
with d = 28, d*® = 1 (one Wi and one MA), and set 
to 20 and 5, = 12, we have: 

^{|W*|,|6*|} = (3 X 28-b 28)-b (20 X 5-f 20) 

-b(240 X 12 -b 12) -b (12-b 1) 

= 3197 (17) 

It is possible to add more layers to construct a deeper ar¬ 
chitecture based on temporal embedding and convolution. 
However, the data itself must be complex enough to provide 
more potential for the model to exploit. Given the granu¬ 
larity of daily human behaviors, for the task of predicting 
modalities such as traveling distance/time and power con¬ 
sumption, a deeper architecture has only limited effect. 

5. CONCLUSION 

Motivated by the observation that regularities in peri¬ 
odical time-series sometimes manifest at different moments 
and at varied paces, in this paper we propose a technique 
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called temporal embedding and devise a convolutional neu¬ 
ral network-based learning model called TeNet, which is ro¬ 
bust to temporal distortions and misalignments, to learn 
abstract features. First we present TeNet and discuss the 
intuition behind it using a case study, and then describe 
the technical details for the whole network architecture, and 
solve the backpropagation problem for the proposed model. 
In the experiments we use an extensive range of real-life peri¬ 
odical data that covers three modalities to compare the per¬ 
formances of the proposed model against competitive meth¬ 
ods. We hnd that in average TeNet achieves 8% to 33% 
advantage against other methods in difference metrics and 
the advantage scales up with a growing sample size used in 
training. We also find that the accuracy of TeNet increases 
almost monotonically with a growing d, indicating the model 
is effective in utilizing more information and while remain¬ 
ing robust to noise. We also create a set of synthetic data 
from the real-life data to demonstrate the effect of temporal 
embedding and successfully show its capability of realigning 
distorted and misaligned data. At the end of the experiment 
we also offer an in-depth discussion about hyperparameter 
selection, data preprocessing, network depth and number of 
parameters, and present a visualization of the learned snip¬ 
pets. Beyond the periodical accumulation prediction prob¬ 
lem, we expect Tenet to be useful for general time-series 
predictions ranging from forecasts to k-ahead prediction. 
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